Ensemble learning involves a combination of techniques that allows multiple machine learning models, called base learners (or, sometimes, weak learners), to consolidate their predictions and output a single, optimal prediction, given their respective inputs and outputs.
error = (bias^2)+ variance + irreducible error
As is evident, the more base learners we add to the ensemble, the more accurate the ensemble will be, under the condition that each learner is uncorrelated to the others. Of course, this is increasingly difficult to achieve. Furthermore, the law of diminishing returns applies. Each new uncorrelated base learner contributes less to the overall error reduction than the previously added base learner. The following figure shows the ensemble error percentage for a number of uncorrelated base learners. As is evident, the greatest reduction is applied when we add two uncorrelated base learners
Although bias and variance have theoretical formulas, it is difficult to calculate their actual values. A simple way to estimate them empirically is with learning and validation curves.
Ensemble methods are divided into two major classes or taxonomies: generative and non-generative methods.
In this book, we will cover two main non-generative methods: voting and stacking.
The main generative methods that we will cover in this book are bagging, boosting, and random forests.
Ensemble Methods can be used for various reasons, mainly to:
Stacking is a form of meta-learning. The main idea is that we use base learners in order to generate metadata for the problem's dataset and then utilize another learner called a meta-learner, in order to process the metadata.
Base learners are considered to be level 0 learners, while the meta learner is considered a level 1 learner. In other words, the meta learner is stacked on top of the base learners, hence the name stacking.
A more intuitive way to describe the ensemble is to present an analogy with voting. In voting, we combined a number of base learners' predictions in order to increase their performance. In stacking, instead of explicitly defining the combination rule, we train a model that learns how to best combine the base learners' predictions
the preferred approach is to utilize K-fold cross validation. For each K, the base learners will be trained on the K-1 folds and predict on the Kth fold, generating 100/K percent of the final training metadata. By repeating the process K times, one for each fold, we will have generated metadata for the whole training dataset.
It is generally a good idea to mix different learning algorithms, in order to capture both linear and non-linear relationships between the features themselves, as well as the target variable
Generally, the meta-learner should be a relatively simple machine learning algorithm , in order to avoid overfitting.
Furthermore, additional steps should be taken in order to regularize the meta-learner. For example,
If there is a need for more complex models in order to increase the ensemble's predictive performance, a multi-level stack could be used, in which the number of models and each individual model's complexity reduces as the stack's level increases:
Another really important characteristic of the meta-learner should be the ability to handle correlated inputs and especially to not make any assumptions about the independence of features from one another, as naive Bayes classifiers do. The inputs to the meta-learner (metadata) will be highly correlated. This happens because all base learners are trained to predict the same target. Thus, their predictions will come from an approximation of the same function. Although the predicted values will vary, they will be close to each other
# --- SECTION 1 ---
# Libraries and data loading
from sklearn.datasets import load_diabetes
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import KFold
from sklearn import metrics
import numpy as np
diabetes = load_diabetes()
train_x, train_y = diabetes.data[:400], diabetes.target[:400]
test_x, test_y = diabetes.data[400:], diabetes.target[400:]
print(len(train_x), len(test_x))
# they are numpy arrays
# --- SECTION 2 ---
# Create the ensemble's base learners and meta-learner
# Append base learners to a list for ease of access
base_learners = []
knn = KNeighborsRegressor(n_neighbors=5)
dt = DecisionTreeRegressor(max_depth=4 , random_state=13)
ridge = Ridge()
base_learners.append(knn)
base_learners.append(dt)
base_learners.append(ridge)
meta_learner = LinearRegression()
# --- SECTION 3 ---
# Create the training metadata
# Create variables to store metadata and their targets
meta_data = np.zeros((len(base_learners), len(train_x)))
meta_targets = np.zeros(len(train_x))
# Creating cross-validation folds
KF = KFold(n_splits= 5)
meta_index = 0
for train_indices, test_indices in KF.split(train_x):
# Train each learner on the K-1 folds
# and create metadata for the Kth fold
for i in range(len(base_learners)):
learner = base_learners[i]
learner.fit(train_x[train_indices], train_y[train_indices])
predictions = learner.predict(train_x[test_indices])
meta_data[i][meta_index:meta_index+len(test_indices)] = \
predictions
meta_targets[meta_index:meta_index+len(test_indices)] = \
train_y[test_indices]
meta_index += len(test_indices)
# Transpose the metadata to be fed into the meta-learner
meta_data = meta_data.transpose()
In order to further clarify the relationships between the meta_data and self.learners lists, we graphically depict their interactions as follows. We initialize meta_data[0] for the sake of code simplicity. While it can be misleading to store the actual input data in the meta_data list, it avoids the need to handle the first level of base learners in a different way than the rest:
'''
For the test set, we do not need to split it into folds.
We simply train the base learners on the whole train set and predict on the test set.
Furthermore, we evaluate each base learner and store the evaluation metrics, in order to compare them with the ensemble's performance.
As this is a regression problem, we use R-squared and Mean Squared Error (MSE) as evaluation metrics
'''
# --- SECTION 4 ---
# Create the metadata for the test set and evaluate the base learners
test_meta_data = np.zeros((len(base_learners), len(test_x)))
base_errors = []
base_r2 = []
for i in range(len(base_learners)):
learner = base_learners[i]
learner.fit(train_x, train_y)
predictions = learner.predict(test_x)
test_meta_data[i] = predictions
err = metrics.mean_squared_error(test_y, predictions)
r2 = metrics.r2_score(test_y, predictions)
base_errors.append(err)
base_r2.append(r2)
test_meta_data = test_meta_data.transpose()
# --- SECTION 5 ---
# Fit the meta-learner on the train set and evaluate it on the test set
meta_learner.fit(meta_data, meta_targets)
ensemble_predictions = meta_learner.predict(test_meta_data)
err = metrics.mean_squared_error(test_y, ensemble_predictions)
r2 = metrics.r2_score(test_y, ensemble_predictions)
# --- SECTION 6 ---
# Print the results
print('MSE | R2 |Name')
print('-'*20)
for i in range(len(base_learners)):
learner = base_learners[i]
print(f'{base_errors[i]:.1f}|{base_r2[i]:.2f}|{learner.__class__.__name__}')
print(f'{err:.1f}|{r2:.2f}|Ensemble')
As is evident, r-squared has improved by over 16% from the best base learner (ridge regression), while MSE has improved by almost 20%. This is a considerable improvement.
Additionally stacking regressor class of sklearn can also be used
In essence, we are simulating how a statistic would behave if we collected many samples from the original population. In order to create bootstrap samples, we resample with replacement (each instance may be selected multiple times) from our original sample
Bagging makes use of bootstrap samples in order to train an array of base learners. It then combines their predictions using voting. The motivation behind this method is to produce diverse base learners by diversifying the train sets.
Creating base learners
The procedure is depicted as follows:
Creating base learners through bagging
Proof: if you call S a sample of n numbers from {1:n} drawn randomly and with replacement, S contains on average approximately (1−(1/e))n ≈ 0.63212056n unique elements.
For example, using a single decision tree allows for great interpretability, as the decision of each node is readily available. Using a bagging ensemble of 100 trees makes the individual decisions less important, while the collective predictions define the ensemble's final output.
### classification example -> mnist
'''
1500 instances as the train set
and the remaining 297 as the test set.
We will generate 10 bootstrap samples, and consequently 10 decision-tree models.
We will then combine the base predictions using hard voting:
'''
# --- SECTION 1 ---
# Libraries and data loading
from sklearn.datasets import load_digits
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import numpy as np
digits = load_digits()
train_size = 1500
train_x, train_y = digits.data[:train_size], digits.target[:train_size]
test_x, test_y = digits.data[train_size:], digits.target[train_size:]
# --- SECTION 2 ---
# Create our bootstrap samples and train the classifiers
ensemble_size = 10
base_learners = []
# we sample indices in order to access features and targets
for _ in range(ensemble_size):
bootstrap_indices= np.random.randint(0, train_size, size= train_size)
bootstrap_x = train_x[bootstrap_indices]
bootstrap_y = train_y[bootstrap_indices]
dt = DecisionTreeClassifier()
dt.fit(bootstrap_x, bootstrap_y)
base_learners.append(dt)
# --- SECTION 3 ---
# Predict with the base learners and evaluate them
base_predictions = []
base_accuracy = []
for learner in base_learners:
predictions = learner.predict(test_x)
base_predictions.append(predictions)
acc = metrics.accuracy_score(test_y, predictions)
base_accuracy.append(acc)
# Combine the base learners' predictions using hardvoting
ensemble_predictions = []
# Find the most voted class for each test instance
for i in range(len(test_y)):
votes = [0 for _ in range(10)]
for learner_predictions in base_predictions:
votes[learner_predictions[i]] = votes[learner_predictions[i]]+1
# Find the class with most votes
final = np.argmax(votes)
# Add the class to the final predictions
ensemble_predictions.append(final)
ensemble_acc = metrics.accuracy_score(test_y, ensemble_predictions)
# --- SECTION 5 ---
# Print the accuracies
print('Base Learners:')
print('-'*30)
for index, acc in enumerate(sorted(base_accuracy)):
print(f'Learner {index+1}: %.2f' % acc)
print('-'*30)
print('Bagging: %.2f' % ensemble_acc)
Boosting aims to combine a number of weak learners into a strong ensemble. It is able to reduce bias, but also variance Here, weak learners are individual models that perform slightly better than random. For example, in a classification dataset with two classes and an equal number of instances belonging to each class, a weak learner will be able to classify the dataset with an accuracy of slightly more than 50%.
create a number of uncorrelated weak learners and then combine their predictions. The main difference with bagging is that instead of creating a number of independent bootstrapped train sets,
Weighted sampling is the sampling process were each candidate has a corresponding weight, which determines its probability of being sampled. The weights are normalized, in order for their sum to equal one. Then, the normalized weights correspond to the probability that any individual will be sampled
A weak classifier (e.g. a decision stump) is made on top of the training data based on the weighted samples. Here, the weights of each sample indicate how important it is to be correctly classified. Initially, for the first stump, we give all the samples equal weights
We create a decision stump for each variable and see how well each stump classifies samples to their target classes. For example, in the diagram below we check for Age, Eating Junk Food, and Exercise. We'd look at how many samples are correctly or incorrectly classified as Fit or Unfit for each individual stump.
More weight is assigned to the incorrectly classified samples so that they're classified correctly in the next decision stump. Weight is also assigned to each classifier based on the accuracy of the classifier, which means high accuracy = high weight!
Reiterate from Step 2 until all the data points have been correctly classified, or the maximum iteration level has been reached.
For regression purposes, the steps are similar:
Boosting algorithms are able to reduce both bias and variance.
For a long time, they were considered immune to overfitting, but in fact they can overfit, although they are extremely robust.
One possible explanation is that the base learners, in order to classify outliers, create very strong and complicated rules that rarely fit other instances.
One disadvantage of many boosting algorithms is that they are not easily parallelized, as the models are created in a sequential fashion.
Furthermore, they pose the usual problems of ensemble learning techniques, such as reduction in interpretability and additional computational costs.
## Adaboost classification for breast cancer classification
# --- SECTION 1 ---
# Libraries and data loading
from copy import deepcopy
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import numpy as np
bc = load_breast_cancer()
train_size = 400
train_x, train_y = bc.data[:train_size], bc.target[:train_size]
test_x, test_y = bc.data[train_size:], bc.target[train_size:]
np.random.seed(123456)
# Creating the ensemble
ensemble_size = 3
base_classifier = DecisionTreeClassifier(max_depth=1)
#creating initial weights
data_weights = np.zeros(train_size) + 1/ train_size
# train_indices
indices = [ x for x in range(train_size)]
base_learners =[]
learners_errors = np.zeros(ensemble_size)
learners_weights = np.zeros(ensemble_size)
'''
For each base learner, we will create a deepcopy of the original classifier, train it on a sample dataset, and evaluate it. First, we create the copy and sample with replacement from the original test set, according to the instance's weights:
'''
# creating each base learner
for i in range(ensemble_size):
weak_learner = deepcopy(base_classifier)
# choose the samples by sampling with replacement
# each instance's probability is dictated by its weight
data_indices = np.random.choice(indices, train_size, p=data_weights)
sample_x, sample_y = train_x[data_indices], train_y[data_indices]
# fitting the weak learner and evaluating
weak_learner.fit (sample_x, sample_y)
predictions = weak_learner.predict(train_x)
errors = predictions != train_y
corrects = predictions == train_y
# calculating weighted errors
weighted_errors = data_weights * errors
# base learner's error is the average of the weighted errors
learner_error = np.mean(weighted_errors)
learners_errors[i] = learner_error
'''
Finally, the learner's weight can be calculated as half the natural logarithm of the weighted accuracy over the weighted error.
In turn, we can use the learner's weight to calculate the new data weights.
For erroneously classified instances, the new weight equals the natural exponent of the old weight times the learner's weight.
For correctly classified instances, the negative multiple is used instead.
Finally, the new weights are normalized and the base learner is added to the base_learners list
'''
# The learner's weight
learner_weight = np.log((1-learner_error)/learner_error)/2
learners_weights[i] = learner_weight
# Update the data weights
data_weights[errors] = np.exp(data_weights[errors] * learner_weight)
data_weights[corrects] = np.exp(-data_weights[corrects] * learner_weight)
data_weights = data_weights/sum(data_weights)
# Save the learner
base_learners.append(weak_learner)
# --- SECTION 3 ---
# Evaluate the ensemble
ensemble_predictions = []
for learner, weight in zip(base_learners, learners_weights):
# Calculate the weighted predictions
prediction = learner.predict(test_x)
ensemble_predictions.append(prediction*weight)
# The final prediction is the weighted mean of the individual predictions
ensemble_predictions = np.mean(ensemble_predictions, axis=0) >= 0.5
ensemble_acc = metrics.accuracy_score(test_y, ensemble_predictions)
# --- SECTION 4 ---
# Print the accuracy
print('Ada Boosting: %.2f' % ensemble_acc)
It is a more generalized boosting framework compared to AdaBoost, which also makes it more complicated and math-intensive.
## Instead of trying to emphasize problematic instances by assigning weights and resampling the dataset, gradient boosting builds each base learner on the previous learner's errors.
Furthermore, gradient boosting uses decision trees of varying depths not like decision stumps
The gradient boosting algorithm (for regression purposes) starts by calculating the mean of the target variable for the train set and uses it as an initial prediction. Then, it calculates the difference of each instance's target from the prediction (mean), in order to calculate the error. These errors are also called pseudo-residuals.
1. Define the learning rate (smaller than 1) and the ensemble's size.
2. Calculate the train set's target mean.
3. Using the mean as a very simple initial prediction, calculate each instance's target difference from the mean. These errors are called pseudo-residuals.
4. Build a decision tree, by using the original train set's features and the pseudo-residuals as targets.
5. Make predictions on the train set, using the decision tree (we try to predict the pseudo-residuals). Multiply the predicted values by the learning rate.
6. Add the multiplied values to the previously stored predicted values. Use the newly calculated values as predictions.
7. Calculate the new pseudo-residuals using the calculated predictions.
8. Repeat from Step 4 until the desired ensemble size is achieved.
# --- SECTION 1 ---
# Libraries and data loading
from copy import deepcopy
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
import numpy as np
diabetes = load_diabetes()
train_size = 400
train_x, train_y = diabetes.data[:train_size], diabetes.target[:train_size]
test_x, test_y = diabetes.data[train_size:], diabetes.target[train_size:]
np.random.seed(123456)
# --- SECTION 2 ---
# Create the ensemble
# Define the ensemble's size, learning rate and decision tree depth
ensemble_size = 50
learning_rate = 0.1
base_classifier = DecisionTreeRegressor(max_depth=3)
# Create placeholders for the base learners and each step's prediction
base_learners = []
# Note that the initial prediction is the target variable's mean
previous_predictions = np.zeros(len(train_y)) + np.mean(train_y)
# Create the base learners
for _ in range(ensemble_size):
# Start by calculating the pseudo-residuals
errors = train_y - previous_predictions
# Make a deep copy of the base classifier and train it on the
# pseudo-residuals
learner = deepcopy(base_classifier)
learner.fit(train_x, errors)
predictions = learner.predict(train_x)
# Multiply the predictions with the learning rate and add the results
# to the previous prediction
previous_predictions += (learning_rate)*predictions
# Save the base learner
base_learners.append(learner)
# --- SECTION 3 ---
# Evaluate the ensemble
# Start with the train set's mean
previous_predictions = np.zeros(len(test_y)) + np.mean(train_y)
# For each base learner predict the pseudo-residuals for the test set and
# add them to the previous prediction,
# after multiplying with the learning rate
for learner in base_learners:
predictions = learner.predict(test_x)
previous_predictions = previous_predictions + learning_rate*predictions
# --- SECTION 4 ---
# Print the metrics
r2 = metrics.r2_score(test_y, previous_predictions)
mse = metrics.mean_squared_error(test_y, previous_predictions)
print('Gradient Boosting:')
print('R-squared: %.2f' % r2)
print('MSE: %.2f' % mse)
Random Forests expand on bagging by inducing randomness not only on each base learner's train samples, but in the features as well.
Bagging is able to produce reasonably uncorrelated trees by diversifying each tree's train set through bootstrapping. But bagging only diversifies the trees by acting on one axis: each set's instances.
There is still a second axis on which we can introduce diversity, the features.
By selecting a subset of the available features during training, the generated base learners can be even more diverse.
In random forests, for each tree and at each node, only a subset of the available features is considered when choosing the best feature/split point combination.
The number of features that will be selected can be optimized by hand, but one-third of all features for regression problems and the square root of all features are considered to be a good starting point.
1. Create a bootstrap train sample
2. Select the node to split
3. Select m features randomly
4. Pick the best feature and split point from m
5. Split the node into two nodes
6. Repeat from step 2-2 until a stopping criterion is met, such as maximum tree depth
Another method to create trees in a Random Forest ensemble is Extra Trees (extremely randomized trees). The main difference with the previous method is that the feature and split point combination does not have to be the optimal. Instead, a number of split points are randomly generated, one for each available feature. The best split point of those generated is selected. The algorithm constructs a tree as follows:
Select the number of features m that will be considered at each node and the minimum number of samples n in order to split a node
For each base learner, do the following:
Create a bootstrap train sample
Select the node to split (the node must have at least n samples)
Select m features randomly
Randomly generate m split points, with values between the minimum and maximum value of each feature
Select the best of these split points
Split the node into two nodes and repeat from step 2-2 until there are no available nodes
By creating a number of trees using any valid randomization method, we have essentially created a forest, hence the algorithm's name. After generating the ensemble's trees, their predictions must be combined in order to have a functional ensemble.
This is usually achieved through majority voting for classification problems and through averaging for regression problems. There are a number of hyperparameters associated with Random Forests, such as the number of features to consider at each node split, the number of trees in the forest, and the individual tree's size. As mentioned earlier, a good starting point for the number of features to consider is as follows:
The square root of the number of total features for classification problems
One-third of the number of total features for regression problems
Random Forests provide information about the underlying dataset that most of other methods cannot easily provide. A prominent example is the importance of each individual feature in the dataset. One method to estimate feature importance is to use the Gini index for each node of each tree and compare each feature's cumulative value. Another method uses the out-of-bag samples. First, the out-of-bag accuracy is recorded for all base learners. Then, a single feature is chosen and its values are shuffled in the out-of-bag samples. This results in out-of-bag sample sets with the same statistical properties as the original sets, but any predictive power that the chosen feature might have is removed (as there is now zero correlation between the selected feature's values and the target). The difference in accuracy between the original and the partially random dataset is used as measure for the selected feature's importance.
Concerning bias and variance, although random forests seem to cope well with both, they are certainly not immune. Bias can appear when the available features are great in number, but only few are correlated to the target. When using the recommended number of features to consider at each split (for example, the square root of the number of total features), the probability that a relevant feature will be selected can be small. The following graph shows the probability that at least one relevant feature will be selected, as a function of relevant and irrelevant features (when the square root of the number of total features is considered at each split)
Variance can also appear in Random Forests, although the method is sufficiently resistant to it. Variance usually appears when the individual trees are allowed to grow fully. We have previously mentioned that as the number of trees increases, the error approximates a certain limit. Although this claim still holds true, it is possible that the limit itself overfits the data. Restricting the tree size (by increasing the minimum number of samples per leaf or reducing the maximum depth) can potentially help in such circumstances.
Random Forests are a very robust ensemble learning method, able to reduce both bias and variance, similar to boosting. Furthermore, the algorithm's nature allows it to be fully parallelized, both during training, as well as during prediction. This is a considerable advantage over boosting methods, especially when large datasets are concerned. Furthermore, they require less hyperparameter fine-tuning, compared to boosting techniques, especially XGBoost.
The main weaknesses of random forests are their sensitivity to class imbalances, as well as the problem we mentioned earlier, which involves a low ratio of relevant to irrelevant features in the train set. Furthermore, when the data contains low-level non-linear patterns (such as in raw, high-resolution image recognition), Random Forests usually are outperformed by deep neural networks. Finally, Random Forests can be computationally expensive when very large datasets are used combined with unrestricted tree depth.